Evaluating the ability of a predictive vision-based machine learning model to measure changes in gait in response to medication and DBS within individuals with Parkinson’s disease

Introduction Gait impairments in Parkinson’s disease (PD) are treated with dopaminergic medication or deep-brain stimulation (DBS), although the magnitude of the response is variable between individuals. Computer vision-based approaches have previously been evaluated for measuring the severity of parkinsonian gait in videos, but have not been evaluated for their ability to identify changes within individuals in response to treatment. This pilot study examines whether a vision-based model, trained on videos of parkinsonism, is able to detect improvement in parkinsonian gait in people with PD in response to medication and DBS use. Methods A spatial–temporal graph convolutional model was trained to predict MDS-UPDRS-gait scores in 362 videos from 14 older adults with drug-induced parkinsonism. This model was then used to predict MDS-UPDRS-gait scores on a different dataset of 42 paired videos from 13 individuals with PD, recorded while ON and OFF medication and DBS treatment during the same clinical visit. Statistical methods were used to assess whether the model was responsive to changes in gait in the ON and OFF states. Results The MDS-UPDRS-gait scores predicted by the model were lower on average (representing improved gait; p = 0.017, Cohen’s d = 0.495) during the ON medication and DBS treatment conditions. The magnitude of the differences between ON and OFF state was significantly correlated between model predictions and clinician annotations (p = 0.004). The predicted scores were significantly correlated with the clinician scores (Kendall’s tau-b = 0.301, p = 0.010), but were distributed in a smaller range as compared to the clinician scores. Conclusion A vision-based model trained on parkinsonian gait did not accurately predict MDS-UPDRS-gait scores in a different PD cohort, but detected weak, but statistically significant proportional changes in response to medication and DBS use. Large, clinically validated datasets of videos captured in many different settings and treatment conditions are required to develop accurate vision-based models of parkinsonian gait. Supplementary Information The online version contains supplementary material available at 10.1186/s12938-023-01175-y.

In our experiments, there was a significant difference in the mean MDS-UPDRS-gait score predicted by the model in the ON and OFF conditions when evaluating with each of the three pose-estimation libraries (Table A-1).The Kendall τ B coefficient is moderate and statistically significant for the correlation between the discrete clinician-annotated scores and the continuous ML model predictions for all three pose-estimation libraries (Table A-2).However, the difference in the magnitude of change in ON and OFF states as rated by the clinicians and predicted by the model is only statistically significant for the Detectron pose-estimation library.

Conclusion
When evaluating the ST-GCN ML MDS-UPDRS-gait prediction model on the PD dataset available for this study, the Detectron library yielded the most statistically significant results for the metrics assessed.

Appendix B -Training on both the DIP and PD Dataset in LOSOCV
The main manuscript assessed the performance of the proposed ML model in predicting MDS-UPDRS-gait scores on an unseen PD cohort after being trained solely on a dataset of individuals with DIP.A further experiment was conducted to evaluate whether the addition of data from the PD dataset in the training set would improve performance.

Methods
The same ST-GCN ML model described in the main manuscript was trained on the entire DIP cohort and the PD cohort in a leave-one-subject cross-validation (LOSOCV) scheme, ensuring that the individual on whom the model was being evaluated was not included in the training data.The same training parameters, data normalization, and data augmentation approaches were used as for the models that were trained on only the DIP dataset.

Results
Table B-1 presents the macro-averaged precision, recall, F1-score; as well as the mean predicted MDS-UPDRS-gait score in the ON and OFF states on the test set of the model trained on the combined DIP and PD datasets (in a LOSOCV manner).
Table B-2 presents the Kendall τB estimates and p-values for the correlations between the model-predicted and clinician annotated scores, as well as for the correlations between the differences in ON and OFF states as rated by the model and clinicians.Figure B-1 displays the trends in model-predicted MDS-UPDRS-gait scores in the ON and OFF states when paired by participant and clinical visit.The correlations and differences between ON and OFF states are not statistically significant, with the exception of the differences in ON/OFF state mean MDS-UPDRS-gait predictions when using the OpenPose library.

Conclusion
Based on our experiments, the addition of training data from the PD cohort worsened performance of the model significantly when evaluating on unseen examples from this dataset.It is hypothesized that the inclusion of data from a different clinical population which was collected in a different environment and annotated by different clinicians provided conflicting information to the model with respect to what a prototypical walk of each clinical score category should look like.We hypothesize that due to the differences in the labels/input data, data points from the smaller PD cohort served as outliers or anomalous points and provided conflicting information to the model which was otherwise trained on a larger cohort of datapoints from the DIP dataset.This suggests that proposed model is not able to identify and distill the identifying characteristics of each MDS-UPDRS-gait score class well when trained on a small number of examples from two different datasets.Interestingly, the range of scores predicted by the model was larger when the PD walks were included in the training set (Figure B-1 vs Figure A-1), suggesting that the model learned to make predictions at this higher range through the inclusion of examples of the walks with clinician-annotated scores of 3 available in the PD dataset.Future work will examine whether also training on walks from the dataset on which the model will be evaluated is beneficial when larger datasets are available.

Appendix C -Evaluation of OF-DDNet Model
In addition to the ST-GCN model presented in the main text, we explored the ordinal focal neural double-feature, doublemotion network (OF-DDNet) proposed by Lu et al. [2].This file presents the additional methods and results associated with this model.

Methods
As with the ST-GCN model, the OF-DDNet model was trained on the 2D joint trajectories extracted from the poseestimation libraries (AlphaPose, Detectron, OpenPose).Similarly, the DIP dataset was used for training and the model was evaluated on the MDC dataset of adults with PD.Yellow is used to denote pairs where no change was noted between the two treatment conditions.The navy lines represent the mean prediction for each treatment condition.
Unlike for the ST-GCN model, the input data was not normalized as the internal features extracted by the OD-DDNet are location and viewpoint invariant [2].However, the model was adapted to predict continuous UPDRS-gait scores by multiplying the predicted logits by the corresponding scores and summing for all scores.

Results
In our experiments, the "middle" weight configuration provided in the source code of the original DDNet work [3] yielded the best results and was thus used for the results presented below.Unlike the ST-GCN results presented in Figure 2 of the main text, there is no clear downward trend in UPDRS-gait scores predicted by the OF-DDNet when moving from OFF to ON treatment conditions.
Table C-1 presents the F1-scores for the OF-DDNet model when trained on the DIP dataset and evaluated on the PD dataset.
The mean predicted scores during ON and OFF treatment conditions, as well as the significance values for a one-tailed paired t-test assessing whether the predicted OFF condition score is higher than the ON condition score for walks paired by participant and clinical visit is also presented in   Consistent with the ST-GCN model, the F1-scores for the OF-DDNet model were low, indicating that the output of the model does not an accurately predict MDS-UPDRS-gait score.However, unlike for the ST-GCN model (Table 2, main text), the predicted OFF treatment scores were not significantly higher than the ON treatment scores for the OF-DDNet model.
Table C-2 presents Kendall's Tau-b correlation coefficient (τB) and p-value for significance of the correlation in the difference in UPDRS-gait scores as labelled by the clinician and by the OF-DDNet model.

Conclusion
In our experiments, the OF-DDNet model did not predict significantly lower UPDRS-gait scores during ON and OFF treatment states (Table C-1).The raw scores and differences between the scores predicted during ON and OFF treatment conditions as predicted by the OF-DDNet model were not correlated with the scores annotated by the clinicians only when any of the pose-estimation libraries were used (Table C-2).Unlike the ST-GCN model, the OF-DDNet model was not responsive to treatment condition and magnitude on the dataset evaluated in this study.

Appendix D -Evaluation ST-GCN by Repetition
When training the ST-GCN model evaluated in this large differences were noted in the raw MDS-UPDRS-gait values predicted by the model after training when initialized with different seeds .For this reason, the model was trained from scratch five times (with different starting seeds) and the model performance in the main manuscript was reported using the mean prediction for each walk across all five repetitions.However, a measure of the reliability of the model is the stability of the main conclusion over each repetition.For example, robust and reliable model should yield the same conclusion (ie.identification of improvement, no change, or worsening parkinsonism in gait) when evaluated on each repetition.

Results and Discussion
Table D-1 presents the predicted direction of change for each repetition of the model, the mean model prediction (as reported in the main manuscript), the direction of change as annotated by each rater, as well as the mean clinician annotation in the ON and OFF state and direction of change for each paired walk.
From Table D-1, it was observed that the model-predicted direction of change was relatively stable across each model training repetition.For all paired walks, any deviation in the direction of change across different model training repetitions was between adjacent categories (ie.no row contained both a repetition indicating an "increase" and "decrease" in predicted MDS-UPDRS-gait score).Furthermore, in 13 of the 21 paired assessments, the model-predicted direction of change was consistent across all five repetitions.This suggests that even across different weight initializations and other stochasticity associated with training, the models are able to learn similar features of gait across repetitions such that the direction of their predictions on unseen walk pairs are relatively consistent.In our experiments, there was a difference in the number of pairs for which the model-predicted direction of change was congruent with the mean clinician direction of change (ranging from 11 to 13 out of the 21 pairs).The number of assessments where at least one of the repetitions of the model-predicted directions of change were congruent with the mean clinician-annotated direction is slightly higher at 16 out of 21.
From Table D-1, it can be observed that the five trials where the model predicted an increase in MDS-UPDRS-gait scores between the OFF and ON treatment condition corresponded to the trials which were not consistent with the mean clinician-annotated direction of change.Of the five trials where the model predicted an increase in MDS-UPDRS-gait score, two of the trials had a mean clinician-annotated MDS-UPDRS-gait score of 3 (severe impairment).Individuals with this level of impairment were not present in the DIP training set, so it is likely that the model underestimated the level of parkinsonism in the OFF state, and thus always predicted a higher score in the ON state.This hypothesis is supported by Figure D-1, in which the red lines (representing an increase in MDS-UPDRS-gait score when moving from OFF to ON state) are generally on the lower end of the range in the OFF state.Furthermore, in one of the five trials where there was an increase in model-predicted score when moving from OFF to ON treatment one of the clinicians also noted an increase in their annotated MDS-UPDRS-gait score (Table D-1, fourth row from bottom).
Overall, the direction of change between ON/OFF states as predicted by the ST-GCN model is generally consistent between repetitions, suggesting the stochasticity of training does not impede the key features learned by the model.

Figure A- 1 .
Figure A-1.Spaghetti plots of MDS-UPDRS-gait scores predicted by the model in ON and OFF states, grouped by patient and clinical visit.The red lines represent the mean prediction for each treatment condition.
Figure B-1 to Figure A-1, it is evident that the addition of the training data from the PD dataset yielded poorer results when evaluating on the unseen data in the PD dataset.

Figure B- 1 .
Figure B-1.Spaghetti plots of MDS-UPDRS-gait scores predicted by the model in ON and OFF states, grouped by patient and clinical visit.Green lines indicate paired walks where the MDS-UPDRS-gait score was higher in the OFF state than the ON state (indicating improvement in gait ON treatment), while red lines denote the pairs where the MDS-UPDRS-gait score was higher ON treatment (indicating worsening gait).Yellow is used to denote pairs where no change was noted between the two treatment conditions.The navy lines represent the mean prediction for each treatment condition.

Figure
Figure C-1 presents the UPDRS-gait scores predicted by the OF-DDNet model when trained on the DIP dataset and evaluated on the PD dataset.The results are grouped by patient and clinical visit, and are presented during ON and OFF treatment conditions.

Figure C- 1 .
Figure C-1.Spaghetti plots of MDS-UPDRS-gait scores predicted by the model in ON and OFF states, grouped by patient and clinical visit.Green lines indicate paired walks where the MDS-UPDRS-gait score was higher in the OFF state than the ON state (indicating improvement in gait ON treatment), while red lines denote the pairs where the MDS-UPDRS-gait score was higher ON treatment (indicating worsening gait).Yellow is used to denote pairs where no change was noted between the two treatment conditions.The navy lines represent the mean prediction for each treatment condition.
1. Spaghetti plots of MDS-UPDRS-gait scores predicted by the model in the ON and OFF states, grouped by patient and clinical visit, and presented by model training repetition.Green lines indicate paired walks where the MDS-UPDRS-gait score was higher in the OFF state than the ON state (indicating improvement in gait ON treatment), while red lines denote the pairs where the MDS-UPDRS-gait score was higher ON treatment (indicating worsening gait).Yellow is used to denote pairs where no change was noted between the two treatment conditions.The navy lines represent the mean prediction for each treatment condition.

TABLE B -1
Macro-averaged Precision, Recall, F1-Score, Mean MDS-UPDRS-Gait Score Prediction During ON and OFF States and Paired T-Test Significance Value for the ST-GCN model trained on the DIP and PD Dataset in a LOSOCV Manner

Pose-estimation library Correlation of MDS-UPDRS-gait score predicted by ML model and annotated by clinicians Correlation of difference between ON/OFF state as rated by clinicians and ML model Kendall τB estimate Kendall τB p-value Kendall τB estimate Kendall τB p-value
Comparing Table B-1 to Table A-1, Table B-2 toTable A-2, and

TABLE C -1
Macro-averaged Precision, Recall, F1-Score, Mean MDS-UPDRS-Gait Score Prediction During ON and OFF States and Paired T-Test Significance Value for the OF-DDNet Model

TABLE C -2
One-tailed Kendall τB Estimates and P-Values for Correlation Strength Between Model-Predicted and Clinician-Annotated MDS-UPDRS-Gait Values and Differences in Scores Between ON and OFF States for the OF-DDNet Model

TABLE D -1
Direction of Change for Each Repetition of the ST-GCN Model Trained, Each Clinician, and Mean Clinician-Annotated MDS-UPDRS-gait Score Direction of Change in Model-Predicted MDS-UPDRS-gait Score Between OFF and ON States Direction of Change in Clinician-Annotated MDS-UPDRS-gait Score Between OFF and ON States